Transforming Features





Kerry Back

Outliers, scaling, and polynomial features

  • For neural nets and other methods, it is important to have predictors that are
    • on the same scale
    • free of outliers
  • It is also useful to add squares and products of our predictors.
  • We will (i) take care of outliers and scaling, (ii) add squares and products, and (iii) define a machine learning model all within a pipeline.

Neural net example of scaling and outliers

For a neuron with

\[ y = \max(0, b + w_1x_1 + \cdots + w_n x_n)\]

  • to find the right \(w\)’s, it helps to have \(x\)’s of similar scales
  • and outlier inputs can produce outlier outputs and large errors, so they may get excessive attention in fitting the model

Quantile transformer

There are many ways to take care of outliers and scaling, but we’ll just use one.

from sklearn.preprocessing import QuantileTransformer

transform = QuantileTransformer(
    output_distribution="normal"
)

Example: roeq in 2021-01

Distribution before (old) and after (new)

Pipelines

  • This will be our process:
    • Apply quantile transformer
    • Add squares and products
    • Apply quantile transformer again
  • We do this and define our ML model in a pipeline.
  • Then we fit the pipeline and predict with it.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly = PolynomialFeatures(degree=2)
pipe = make_pipeline(
  transform, 
  poly,
  transform,
  model
)
pipe.fit(X, y)

Entire workflow: connect to database

from sqlalchemy import create_engine
import pymssql
import pandas as pd

server = "mssql-82792-0.cloudclusters.net:16272"
username = "user"
password = "" # paste password between quote marks
database = "ghz"

string = "mssql+pymssql://" + username + ":" + password + "@" + server + "/" + database

conn = create_engine(string).connect()

Download data

data = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date='2021-01'
    """, 
    conn
)
data = data.dropna()
data['rnk'] = data.ret.rank(pct=True)

Import from scikit-learn

from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline

Define pipeline

transform = QuantileTransformer(
    output_distribution="normal"
)
poly = PolynomialFeatures(degree=2)
model = MLPRegressor(
  hidden_layer_sizes=(4, 2),
  random_state=0
)
pipe = make_pipeline(
  transform, 
  poly,
  transform,
  model
)

Fit and save the pipeline

X = data[["roeq", "mom12m"]]
y = data["rnk"]

pipe.fit(X, y)

from joblib import dump, load
dump(pipe, "net2.joblib")


Later:

net = load("net2.joblib")

Comments

  • Same workflow for random forest. Just import RandomForestRegressor and use it in model = .
  • We’re going to change the last block in the next section. Put the pipeline through GridSearchCV and fit it.